import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import scipy.stats as stats
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
400/2000
0.2
400/500
0.8
failure rate for the manufactured item is 5% (p) n=10
stats.binom.pmf(k=0,n=10,p=.05)
0.5987369392383787
stats.binom.pmf(k=1,n=10,p=.05)
0.3151247048623047
stats.binom.cdf(k=2,n=10,p=.05)
0.9884964426207031
1-stats.binom.cdf(k=2,n=10,p=.05)
0.01150355737929687
Mean sells is 3
1-stats.poisson.pmf(0,3)
0.950212931632136
stats.poisson.cdf(4,3)-stats.poisson.cdf(1,3)
0.6161149710523164
k = np.arange(0,11)
poisson = stats.poisson.pmf(k,3)
# Plotting graph for 0-10 cars sold
plt.figure(figsize=(10,7))
plt.bar(k, poisson) # make a bar plot
plt.title("Poisson distribution function for average sells per week :3", fontsize=18) # set the title
plt.xlabel("Number of Cars Sold", fontsize=14) # set the x-axis label
plt.ylabel("Probability of cars sold per-week", fontsize=14) # set the y-axis label
plt.show() # display the plot
percentage of orders that are taken correctly is 86.8% (p) n=3
stats.binom.pmf(k=3,n=3,p=0.868)
0.653972032
stats.binom.pmf(k=0,n=3,p=0.868)
0.0022999680000000003
1-stats.binom.cdf(k=1,n=3,p=0.868)
0.952327936
In Amazon, sellers are onboarded on the basis some criteria that needs to be validated. As per the validation performed, sellers are notified with the acceptance or rejection. Once seller is accepted and have completed all the required tasks, seller can explore and advertise his products on the basis of his needs. Seller can perform different metrics checks and ads monitoring like, total number of ad clicks, total cost of ad clicks, search term used by the customer, etc.
Every year there is Amazon Freedom sale where customers can purchase products with some discounts and all. As this sale happens every year, and we have all pre-historic data where we can compare all the pre-data with the new customer registrations.
1) As sellers are accepted and rejected on the basis of validation, we can use binomial distribution to check acceptance ratio, and how many sellers are successfully onboarded by Amazon in a day. We can calculate the probability percentage which can be used to know how many engineers needs to be deployed for the validation and the onboarding process. 2) Amazon can calculate time consumption for validation of seller, on the basis of which engineer's average time frame can be divided and seller validation can be equally distributed. 3) As seller has visibility to monitor and perform different metrics checks, we can use different concepts to increase the sales:
basketball_data = pd.read_csv('Basketball.csv')
basketball_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null object 3 PlayedGames 61 non-null object 4 WonGames 61 non-null object 5 DrawnGames 61 non-null object 6 LostGames 61 non-null object 7 BasketScored 61 non-null object 8 BasketGiven 61 non-null object 9 TournamentChampion 61 non-null object 10 Runner-up 61 non-null object 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 dtypes: int64(2), object(11) memory usage: 6.3+ KB
Basketball.csv contains 2 columns as int type and 11 columns as object type Total columns are 13 Total rows are 61
# Checking few data using head()
basketball_data.head()
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931to32 | 1 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 |
basketball_data.isnull().sum()
Team 0 Tournament 0 Score 0 PlayedGames 0 WonGames 0 DrawnGames 0 LostGames 0 BasketScored 0 BasketGiven 0 TournamentChampion 0 Runner-up 0 TeamLaunch 0 HighestPositionHeld 0 dtype: int64
There is no empty data, so our basketball dataframe don't need any cleansing for NA/NULL values.
duplicates = basketball_data.duplicated()
sum(duplicates)
0
No Duplicates found.
As there is no data cleansing required, so will be describing the data
basketball_data.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Team | 61 | 61 | Team 1 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Tournament | 61.0 | NaN | NaN | NaN | 24.0 | 26.827225 | 1.0 | 4.0 | 12.0 | 38.0 | 86.0 |
| Score | 61 | 61 | 4385 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PlayedGames | 61 | 53 | 2762 | 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| WonGames | 61 | 59 | 7 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| DrawnGames | 61 | 57 | 14 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| LostGames | 61 | 56 | 37 | 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| BasketScored | 61 | 60 | 70 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| BasketGiven | 61 | 61 | 3140 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| TournamentChampion | 61 | 8 | - | 52 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Runner-up | 61 | 10 | - | 48 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| TeamLaunch | 61 | 47 | 1929 | 10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| HighestPositionHeld | 61.0 | NaN | NaN | NaN | 7.081967 | 5.276663 | 1.0 | 3.0 | 6.0 | 10.0 | 20.0 |
Using describe():
basketball_data.columns
Index(['Team', 'Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames',
'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion',
'Runner-up', 'TeamLaunch', 'HighestPositionHeld'],
dtype='object')
Like all columns are in Upper Camel Case, so lets change Runner-up to the same format as others.
# Checking whole datasets
basketball_data
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931to32 | 1 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 56 | Team 57 | 1 | 34 | 38 | 8 | 10 | 20 | 38 | 66 | - | - | 2009-10 | 20 |
| 57 | Team 58 | 1 | 22 | 30 | 7 | 8 | 15 | 37 | 57 | - | - | 1956-57 | 16 |
| 58 | Team 59 | 1 | 19 | 30 | 7 | 5 | 18 | 51 | 85 | - | - | 1951~52 | 16 |
| 59 | Team 60 | 1 | 14 | 30 | 5 | 4 | 21 | 34 | 65 | - | - | 1955-56 | 15 |
| 60 | Team 61 | 1 | - | - | - | - | - | - | - | - | - | 2017~18 | 9 |
61 rows × 13 columns
While analyzing data got some points:
# Renaming column Runner-up to RunnerUp
basketball_data.rename(columns = {'Runner-up': 'RunnerUp'}, inplace = True)
#Let's keep a duplicate copy of this file, as we will be doing some data cleaning and data imputation.
basketball_data_copy = basketball_data.copy(deep=True)
basketball_data_copy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null object 3 PlayedGames 61 non-null object 4 WonGames 61 non-null object 5 DrawnGames 61 non-null object 6 LostGames 61 non-null object 7 BasketScored 61 non-null object 8 BasketGiven 61 non-null object 9 TournamentChampion 61 non-null object 10 RunnerUp 61 non-null object 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 dtypes: int64(2), object(11) memory usage: 6.3+ KB
# Slicing data values of TeamLaunch column
basketball_data['TeamLaunch']=basketball_data['TeamLaunch'].str.slice(0,4)
# Replace all values which is having '-' as value with '0'
basketball_data=basketball_data.replace('-',0)
# Show Data-Set
basketball_data
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931 | 1 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 56 | Team 57 | 1 | 34 | 38 | 8 | 10 | 20 | 38 | 66 | 0 | 0 | 2009 | 20 |
| 57 | Team 58 | 1 | 22 | 30 | 7 | 8 | 15 | 37 | 57 | 0 | 0 | 1956 | 16 |
| 58 | Team 59 | 1 | 19 | 30 | 7 | 5 | 18 | 51 | 85 | 0 | 0 | 1951 | 16 |
| 59 | Team 60 | 1 | 14 | 30 | 5 | 4 | 21 | 34 | 65 | 0 | 0 | 1955 | 15 |
| 60 | Team 61 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017 | 9 |
61 rows × 13 columns
Data cleansing and data imputation has been done, lets make all columns dType as numeric except Team column. After inspecting the rows, Team 61 data isn't sufficient, so we can remove index 60.
columns=basketball_data.drop('Team', axis=1).columns
basketball_data[columns]=basketball_data[columns].apply(pd.to_numeric)
# Add new column named TeamLaunchCategorisation
basketball_data['TeamLaunchCategorisation']=pd.qcut(x=basketball_data.TeamLaunch, q=4, labels=['Oldest', 'Old', 'New', 'Newest'])
basketball_data.drop(60, inplace=True)
basketball_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 60 entries, 0 to 59 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 60 non-null object 1 Tournament 60 non-null int64 2 Score 60 non-null int64 3 PlayedGames 60 non-null int64 4 WonGames 60 non-null int64 5 DrawnGames 60 non-null int64 6 LostGames 60 non-null int64 7 BasketScored 60 non-null int64 8 BasketGiven 60 non-null int64 9 TournamentChampion 60 non-null int64 10 RunnerUp 60 non-null int64 11 TeamLaunch 60 non-null int64 12 HighestPositionHeld 60 non-null int64 13 TeamLaunchCategorisation 60 non-null category dtypes: category(1), int64(12), object(1) memory usage: 6.5+ KB
# Let's inspect/analyze data on last time
basketball_data
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | TeamLaunchCategorisation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 | Oldest |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 | Oldest |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 | Oldest |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931 | 1 | Oldest |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 | Oldest |
| 5 | Team 6 | 73 | 2819 | 2408 | 990 | 531 | 887 | 3680 | 3373 | 1 | 4 | 1934 | 1 | Oldest |
| 6 | Team 7 | 82 | 2792 | 2626 | 948 | 608 | 1070 | 3609 | 3889 | 0 | 0 | 1929 | 3 | Oldest |
| 7 | Team 8 | 70 | 2573 | 2302 | 864 | 577 | 861 | 3228 | 3230 | 2 | 3 | 1929 | 1 | Oldest |
| 8 | Team 9 | 58 | 2109 | 1986 | 698 | 522 | 766 | 2683 | 2847 | 0 | 1 | 1939 | 2 | Old |
| 9 | Team 10 | 51 | 1884 | 1728 | 606 | 440 | 682 | 2159 | 2492 | 1 | 0 | 1932 | 1 | Oldest |
| 10 | Team 11 | 45 | 1814 | 1530 | 563 | 392 | 575 | 2052 | 2188 | 1 | 5 | 1941 | 1 | Old |
| 11 | Team 12 | 51 | 1789 | 1698 | 586 | 389 | 723 | 2278 | 2624 | 0 | 0 | 1939 | 4 | Old |
| 12 | Team 13 | 42 | 1471 | 1466 | 463 | 384 | 619 | 1767 | 2180 | 0 | 0 | 1948 | 4 | Old |
| 13 | Team 14 | 44 | 1416 | 1428 | 453 | 336 | 639 | 1843 | 2368 | 0 | 1 | 1929 | 2 | Oldest |
| 14 | Team 15 | 43 | 1389 | 1458 | 471 | 358 | 629 | 1753 | 2152 | 0 | 1 | 1944 | 2 | Old |
| 15 | Team 16 | 37 | 1351 | 1318 | 426 | 327 | 565 | 1500 | 1834 | 0 | 0 | 1935 | 4 | Oldest |
| 16 | Team 17 | 36 | 1314 | 1255 | 390 | 330 | 535 | 1421 | 1763 | 0 | 0 | 1949 | 4 | Old |
| 17 | Team 18 | 38 | 1174 | 1192 | 408 | 292 | 492 | 1642 | 1951 | 0 | 0 | 1933 | 3 | Oldest |
| 18 | Team 19 | 27 | 1148 | 988 | 333 | 256 | 399 | 1182 | 1371 | 0 | 0 | 1960 | 3 | New |
| 19 | Team 20 | 33 | 1020 | 1096 | 367 | 242 | 487 | 1347 | 1746 | 0 | 1 | 1951 | 2 | Old |
| 20 | Team 21 | 17 | 970 | 646 | 266 | 172 | 208 | 892 | 789 | 0 | 1 | 1998 | 2 | Newest |
| 21 | Team 22 | 23 | 667 | 742 | 218 | 175 | 349 | 819 | 1157 | 0 | 0 | 1941 | 6 | Old |
| 22 | Team 23 | 17 | 662 | 652 | 189 | 148 | 305 | 760 | 1088 | 0 | 0 | 1977 | 8 | New |
| 23 | Team 24 | 21 | 606 | 678 | 203 | 180 | 295 | 750 | 1022 | 0 | 0 | 1959 | 5 | New |
| 24 | Team 25 | 12 | 553 | 456 | 147 | 112 | 197 | 520 | 633 | 0 | 0 | 2004 | 6 | Newest |
| 25 | Team 26 | 20 | 538 | 628 | 184 | 149 | 295 | 716 | 1050 | 0 | 0 | 1935 | 5 | Oldest |
| 26 | Team 27 | 13 | 510 | 494 | 155 | 128 | 211 | 619 | 744 | 0 | 0 | 1961 | 5 | New |
| 27 | Team 28 | 18 | 445 | 586 | 145 | 143 | 298 | 607 | 992 | 0 | 0 | 1940 | 11 | Old |
| 28 | Team 29 | 12 | 421 | 380 | 125 | 81 | 174 | 458 | 623 | 0 | 0 | 1930 | 6 | Oldest |
| 29 | Team 30 | 11 | 416 | 402 | 113 | 95 | 194 | 430 | 632 | 0 | 0 | 1963 | 6 | New |
| 30 | Team 31 | 12 | 375 | 423 | 123 | 102 | 198 | 422 | 581 | 0 | 0 | 1974 | 7 | New |
| 31 | Team 32 | 14 | 353 | 426 | 129 | 95 | 202 | 492 | 720 | 0 | 0 | 1943 | 4 | Old |
| 32 | Team 33 | 12 | 343 | 448 | 104 | 127 | 217 | 393 | 662 | 0 | 0 | 1977 | 12 | New |
| 33 | Team 34 | 9 | 293 | 346 | 96 | 92 | 158 | 291 | 489 | 0 | 0 | 1987 | 7 | Newest |
| 34 | Team 35 | 11 | 285 | 334 | 103 | 79 | 152 | 419 | 588 | 0 | 0 | 1941 | 4 | Old |
| 35 | Team 36 | 7 | 277 | 270 | 76 | 76 | 118 | 320 | 410 | 0 | 0 | 1991 | 7 | Newest |
| 36 | Team 37 | 6 | 242 | 228 | 62 | 56 | 110 | 244 | 366 | 0 | 0 | 2007 | 8 | Newest |
| 37 | Team 38 | 9 | 230 | 282 | 82 | 63 | 137 | 285 | 430 | 0 | 0 | 1962 | 5 | New |
| 38 | Team 39 | 4 | 190 | 160 | 52 | 45 | 63 | 199 | 241 | 0 | 0 | 1994 | 10 | Newest |
| 39 | Team 40 | 5 | 188 | 186 | 50 | 46 | 90 | 202 | 296 | 0 | 0 | 1978 | 8 | New |
| 40 | Team 41 | 6 | 168 | 204 | 59 | 50 | 95 | 216 | 310 | 0 | 0 | 1971 | 12 | New |
| 41 | Team 42 | 6 | 150 | 180 | 53 | 44 | 83 | 165 | 221 | 0 | 0 | 1963 | 7 | New |
| 42 | Team 43 | 4 | 148 | 152 | 37 | 37 | 78 | 155 | 253 | 0 | 0 | 1999 | 17 | Newest |
| 43 | Team 44 | 3 | 132 | 114 | 35 | 27 | 52 | 139 | 167 | 0 | 0 | 2014 | 10 | Newest |
| 44 | Team 45 | 7 | 107 | 130 | 43 | 21 | 66 | 227 | 308 | 0 | 0 | 1929 | 3 | Oldest |
| 45 | Team 46 | 3 | 96 | 114 | 26 | 44 | 44 | 101 | 139 | 0 | 0 | 1990 | 9 | Newest |
| 46 | Team 47 | 4 | 91 | 116 | 34 | 16 | 66 | 181 | 295 | 0 | 0 | 1947 | 7 | Old |
| 47 | Team 48 | 2 | 83 | 80 | 20 | 23 | 37 | 62 | 117 | 0 | 0 | 1996 | 17 | Newest |
| 48 | Team 49 | 2 | 81 | 80 | 19 | 24 | 37 | 70 | 115 | 0 | 0 | 1995 | 19 | Newest |
| 49 | Team 50 | 4 | 76 | 108 | 30 | 16 | 62 | 145 | 252 | 0 | 0 | 1945 | 10 | Old |
| 50 | Team 51 | 3 | 71 | 90 | 29 | 13 | 48 | 121 | 183 | 0 | 0 | 1953 | 14 | New |
| 51 | Team 52 | 4 | 56 | 72 | 21 | 14 | 37 | 153 | 184 | 0 | 0 | 1929 | 6 | Oldest |
| 52 | Team 53 | 2 | 52 | 68 | 17 | 18 | 33 | 71 | 116 | 0 | 0 | 1979 | 10 | Newest |
| 53 | Team 54 | 3 | 42 | 54 | 18 | 6 | 30 | 97 | 131 | 0 | 0 | 1929 | 8 | Oldest |
| 54 | Team 55 | 2 | 40 | 68 | 13 | 14 | 41 | 70 | 182 | 0 | 0 | 1950 | 16 | Old |
| 55 | Team 56 | 1 | 35 | 38 | 8 | 11 | 19 | 36 | 55 | 0 | 0 | 2016 | 17 | Newest |
| 56 | Team 57 | 1 | 34 | 38 | 8 | 10 | 20 | 38 | 66 | 0 | 0 | 2009 | 20 | Newest |
| 57 | Team 58 | 1 | 22 | 30 | 7 | 8 | 15 | 37 | 57 | 0 | 0 | 1956 | 16 | New |
| 58 | Team 59 | 1 | 19 | 30 | 7 | 5 | 18 | 51 | 85 | 0 | 0 | 1951 | 16 | Old |
| 59 | Team 60 | 1 | 14 | 30 | 5 | 4 | 21 | 34 | 65 | 0 | 0 | 1955 | 15 | New |
# Let's describe the data
basketball_data.describe(include='all')
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | TeamLaunchCategorisation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 60 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.00000 | 60.000000 | 60.000000 | 60.000000 | 60 |
| unique | 60 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4 |
| top | Team 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Oldest |
| freq | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 17 |
| mean | NaN | 24.383333 | 916.450000 | 810.100000 | 309.033333 | 192.083333 | 308.816667 | 1159.350000 | 1159.233333 | 1.45000 | 1.433333 | 1957.950000 | 7.050000 | NaN |
| std | NaN | 26.884620 | 1138.342899 | 877.465393 | 408.481395 | 201.985508 | 294.508639 | 1512.063948 | 1163.946914 | 5.51554 | 4.574679 | 26.646732 | 5.315232 | NaN |
| min | NaN | 1.000000 | 14.000000 | 30.000000 | 5.000000 | 4.000000 | 15.000000 | 34.000000 | 55.000000 | 0.00000 | 0.000000 | 1929.000000 | 1.000000 | NaN |
| 25% | NaN | 4.000000 | 104.250000 | 115.500000 | 34.750000 | 26.250000 | 62.750000 | 154.500000 | 236.000000 | 0.00000 | 0.000000 | 1934.750000 | 3.000000 | NaN |
| 50% | NaN | 12.000000 | 395.500000 | 424.500000 | 124.000000 | 98.500000 | 197.500000 | 444.000000 | 632.500000 | 0.00000 | 0.000000 | 1950.500000 | 6.000000 | NaN |
| 75% | NaN | 39.000000 | 1360.500000 | 1345.500000 | 432.750000 | 331.500000 | 563.500000 | 1669.750000 | 2001.250000 | 0.00000 | 0.000000 | 1977.250000 | 10.000000 | NaN |
| max | NaN | 86.000000 | 4385.000000 | 2762.000000 | 1647.000000 | 633.000000 | 1070.000000 | 5947.000000 | 3889.000000 | 33.00000 | 25.000000 | 2016.000000 | 20.000000 | NaN |
DataSet has been ready for further analysis. Dataset to be used: basketball_data
Checking the overall distribution of continuous data variables.
columns=basketball_data.drop(['Team','TeamLaunchCategorisation'], axis=1).columns
fig=plt.figure(figsize=(20,10))
for i in range(0,len(columns)):
ax=fig.add_subplot(3,4,i+1)
sns.distplot(basketball_data[columns[i]])
ax.set_title(columns[i],color='orange',weight='bold',fontsize=16)
plt.tight_layout()
plt.show()
Above distplot graph represents that there is skewness in which tends to have tail towards higher values, so this type of skewness is called Right Skewed, that means mean would be greater than median for all values, so it can be termed as both Right Skewed or Positive Skewed.
# Comparison of mean and median to check mean and median fact
basketball_data.mean()>basketball_data.median()
Tournament True Score True PlayedGames True WonGames True DrawnGames True LostGames True BasketScored True BasketGiven True TournamentChampion True RunnerUp True TeamLaunch True HighestPositionHeld True dtype: bool
Hence, its proved that dataset's values tends to be right skewed.
As now, we have seen the skewness of graph using distplot, now let's see the 5-Number Summary plot or BoxPlot for each data
columns=basketball_data.drop(['Team','TeamLaunchCategorisation'], axis=1).columns
fig=plt.figure(figsize=(20,10))
for i in range(0,len(columns)):
ax=fig.add_subplot(3,4,i+1)
sns.boxplot(basketball_data[columns[i]])
ax.set_title(columns[i],color='orange',weight='bold',fontsize=16)
plt.tight_layout()
plt.show()
Describing above graphical representation:
As now, we have seen the skewness of graph distplot and bocplot, now let's see histplot with statistical data for each column
columns=basketball_data.drop(['Team','TeamLaunchCategorisation'], axis=1).columns
fig=plt.figure(figsize=(20,10))
for i in range(0,len(columns)):
ax=fig.add_subplot(3,4,i+1)
plt.hist(basketball_data[columns[i]])
plt.axvline(basketball_data[columns[i]].mean(), c='g')
plt.axvline(basketball_data[columns[i]].median(), c='r')
plt.axvline(basketball_data[columns[i]].mode()[0], c='y')
ax.set_title(columns[i],color='orange',weight='bold',fontsize=16)
plt.tight_layout()
plt.show()
Describing above graphical representation:
plt.figure(figsize=(10,5))
sns.countplot(data=basketball_data, x='TeamLaunchCategorisation')
plt.xlabel('TeamLaunchCategorisation', weight='bold', fontsize=16)
plt.ylabel('Count', weight='bold', fontsize=16)
plt.show()
Describing above graphical representation
Same can be represented by pie-chart as below
fig = px.pie(basketball_data, values=basketball_data.TeamLaunchCategorisation.value_counts(), names=basketball_data.TeamLaunchCategorisation.value_counts().index.categories,
color_discrete_sequence=px.colors.sequential.RdBu, title='TeamLaunchCategorisation', hole=0.3)
fig.show()
Let's make graphical representation with Team as category and PlayedGames, WonGames, DrawnGames, LostGames as numerical
basketball_data_sorted_by_played_games=basketball_data.sort_values('PlayedGames', ascending=False)
plt.figure(figsize=(35,15))
sns.scatterplot(data=basketball_data_sorted_by_played_games, x='Team', y='PlayedGames', hue='TeamLaunchCategorisation')
plt.xticks(rotation=45, weight='bold', fontsize=12)
plt.xlabel('Team', weight='bold', fontsize=16)
plt.ylabel('PlayedGames', weight='bold', fontsize=16)
for i in range(0, len(basketball_data_sorted_by_played_games.TeamLaunch)):
plt.annotate(basketball_data_sorted_by_played_games.TeamLaunch[i], (basketball_data_sorted_by_played_games.Team[i], basketball_data_sorted_by_played_games.PlayedGames[i]))
plt.title('Team vs PlayedGames with TeamLaunch and TeamLaunchCategorisation as bifurcation', weight='bold', fontsize=30)
plt.legend(prop={'size': 20})
plt.show()
Describing above graphical representation:
To plot graph for win, lost, draw, champion and runnerup games, we have to convert the data values in standard form, i.e., (expectedValue/totalValue)*100
basketball_data['Win']=np.round((basketball_data.WonGames/basketball_data.PlayedGames)*100,2)
basketball_data['Lost']=np.round((basketball_data.LostGames/basketball_data.PlayedGames)*100,2)
basketball_data['Draw']=np.round((basketball_data.DrawnGames/basketball_data.PlayedGames)*100,2)
basketball_data['Champion%']=np.round((basketball_data.TournamentChampion/basketball_data.Tournament)*100,2)
basketball_data['RunnerUp%']=np.round((basketball_data.TournamentChampion/basketball_data.Tournament)*100,2)
basketball_data
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | TeamLaunchCategorisation | Win | Lost | Draw | Champion% | RunnerUp% | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 | Oldest | 59.63 | 20.38 | 19.99 | 38.37 | 38.37 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 | Oldest | 57.24 | 22.01 | 20.75 | 29.07 | 29.07 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 | Oldest | 47.48 | 29.65 | 22.88 | 12.50 | 12.50 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931 | 1 | Oldest | 44.56 | 32.32 | 23.12 | 7.32 | 7.32 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 | Oldest | 43.77 | 33.31 | 22.92 | 9.30 | 9.30 |
| 5 | Team 6 | 73 | 2819 | 2408 | 990 | 531 | 887 | 3680 | 3373 | 1 | 4 | 1934 | 1 | Oldest | 41.11 | 36.84 | 22.05 | 1.37 | 1.37 |
| 6 | Team 7 | 82 | 2792 | 2626 | 948 | 608 | 1070 | 3609 | 3889 | 0 | 0 | 1929 | 3 | Oldest | 36.10 | 40.75 | 23.15 | 0.00 | 0.00 |
| 7 | Team 8 | 70 | 2573 | 2302 | 864 | 577 | 861 | 3228 | 3230 | 2 | 3 | 1929 | 1 | Oldest | 37.53 | 37.40 | 25.07 | 2.86 | 2.86 |
| 8 | Team 9 | 58 | 2109 | 1986 | 698 | 522 | 766 | 2683 | 2847 | 0 | 1 | 1939 | 2 | Old | 35.15 | 38.57 | 26.28 | 0.00 | 0.00 |
| 9 | Team 10 | 51 | 1884 | 1728 | 606 | 440 | 682 | 2159 | 2492 | 1 | 0 | 1932 | 1 | Oldest | 35.07 | 39.47 | 25.46 | 1.96 | 1.96 |
| 10 | Team 11 | 45 | 1814 | 1530 | 563 | 392 | 575 | 2052 | 2188 | 1 | 5 | 1941 | 1 | Old | 36.80 | 37.58 | 25.62 | 2.22 | 2.22 |
| 11 | Team 12 | 51 | 1789 | 1698 | 586 | 389 | 723 | 2278 | 2624 | 0 | 0 | 1939 | 4 | Old | 34.51 | 42.58 | 22.91 | 0.00 | 0.00 |
| 12 | Team 13 | 42 | 1471 | 1466 | 463 | 384 | 619 | 1767 | 2180 | 0 | 0 | 1948 | 4 | Old | 31.58 | 42.22 | 26.19 | 0.00 | 0.00 |
| 13 | Team 14 | 44 | 1416 | 1428 | 453 | 336 | 639 | 1843 | 2368 | 0 | 1 | 1929 | 2 | Oldest | 31.72 | 44.75 | 23.53 | 0.00 | 0.00 |
| 14 | Team 15 | 43 | 1389 | 1458 | 471 | 358 | 629 | 1753 | 2152 | 0 | 1 | 1944 | 2 | Old | 32.30 | 43.14 | 24.55 | 0.00 | 0.00 |
| 15 | Team 16 | 37 | 1351 | 1318 | 426 | 327 | 565 | 1500 | 1834 | 0 | 0 | 1935 | 4 | Oldest | 32.32 | 42.87 | 24.81 | 0.00 | 0.00 |
| 16 | Team 17 | 36 | 1314 | 1255 | 390 | 330 | 535 | 1421 | 1763 | 0 | 0 | 1949 | 4 | Old | 31.08 | 42.63 | 26.29 | 0.00 | 0.00 |
| 17 | Team 18 | 38 | 1174 | 1192 | 408 | 292 | 492 | 1642 | 1951 | 0 | 0 | 1933 | 3 | Oldest | 34.23 | 41.28 | 24.50 | 0.00 | 0.00 |
| 18 | Team 19 | 27 | 1148 | 988 | 333 | 256 | 399 | 1182 | 1371 | 0 | 0 | 1960 | 3 | New | 33.70 | 40.38 | 25.91 | 0.00 | 0.00 |
| 19 | Team 20 | 33 | 1020 | 1096 | 367 | 242 | 487 | 1347 | 1746 | 0 | 1 | 1951 | 2 | Old | 33.49 | 44.43 | 22.08 | 0.00 | 0.00 |
| 20 | Team 21 | 17 | 970 | 646 | 266 | 172 | 208 | 892 | 789 | 0 | 1 | 1998 | 2 | Newest | 41.18 | 32.20 | 26.63 | 0.00 | 0.00 |
| 21 | Team 22 | 23 | 667 | 742 | 218 | 175 | 349 | 819 | 1157 | 0 | 0 | 1941 | 6 | Old | 29.38 | 47.04 | 23.58 | 0.00 | 0.00 |
| 22 | Team 23 | 17 | 662 | 652 | 189 | 148 | 305 | 760 | 1088 | 0 | 0 | 1977 | 8 | New | 28.99 | 46.78 | 22.70 | 0.00 | 0.00 |
| 23 | Team 24 | 21 | 606 | 678 | 203 | 180 | 295 | 750 | 1022 | 0 | 0 | 1959 | 5 | New | 29.94 | 43.51 | 26.55 | 0.00 | 0.00 |
| 24 | Team 25 | 12 | 553 | 456 | 147 | 112 | 197 | 520 | 633 | 0 | 0 | 2004 | 6 | Newest | 32.24 | 43.20 | 24.56 | 0.00 | 0.00 |
| 25 | Team 26 | 20 | 538 | 628 | 184 | 149 | 295 | 716 | 1050 | 0 | 0 | 1935 | 5 | Oldest | 29.30 | 46.97 | 23.73 | 0.00 | 0.00 |
| 26 | Team 27 | 13 | 510 | 494 | 155 | 128 | 211 | 619 | 744 | 0 | 0 | 1961 | 5 | New | 31.38 | 42.71 | 25.91 | 0.00 | 0.00 |
| 27 | Team 28 | 18 | 445 | 586 | 145 | 143 | 298 | 607 | 992 | 0 | 0 | 1940 | 11 | Old | 24.74 | 50.85 | 24.40 | 0.00 | 0.00 |
| 28 | Team 29 | 12 | 421 | 380 | 125 | 81 | 174 | 458 | 623 | 0 | 0 | 1930 | 6 | Oldest | 32.89 | 45.79 | 21.32 | 0.00 | 0.00 |
| 29 | Team 30 | 11 | 416 | 402 | 113 | 95 | 194 | 430 | 632 | 0 | 0 | 1963 | 6 | New | 28.11 | 48.26 | 23.63 | 0.00 | 0.00 |
| 30 | Team 31 | 12 | 375 | 423 | 123 | 102 | 198 | 422 | 581 | 0 | 0 | 1974 | 7 | New | 29.08 | 46.81 | 24.11 | 0.00 | 0.00 |
| 31 | Team 32 | 14 | 353 | 426 | 129 | 95 | 202 | 492 | 720 | 0 | 0 | 1943 | 4 | Old | 30.28 | 47.42 | 22.30 | 0.00 | 0.00 |
| 32 | Team 33 | 12 | 343 | 448 | 104 | 127 | 217 | 393 | 662 | 0 | 0 | 1977 | 12 | New | 23.21 | 48.44 | 28.35 | 0.00 | 0.00 |
| 33 | Team 34 | 9 | 293 | 346 | 96 | 92 | 158 | 291 | 489 | 0 | 0 | 1987 | 7 | Newest | 27.75 | 45.66 | 26.59 | 0.00 | 0.00 |
| 34 | Team 35 | 11 | 285 | 334 | 103 | 79 | 152 | 419 | 588 | 0 | 0 | 1941 | 4 | Old | 30.84 | 45.51 | 23.65 | 0.00 | 0.00 |
| 35 | Team 36 | 7 | 277 | 270 | 76 | 76 | 118 | 320 | 410 | 0 | 0 | 1991 | 7 | Newest | 28.15 | 43.70 | 28.15 | 0.00 | 0.00 |
| 36 | Team 37 | 6 | 242 | 228 | 62 | 56 | 110 | 244 | 366 | 0 | 0 | 2007 | 8 | Newest | 27.19 | 48.25 | 24.56 | 0.00 | 0.00 |
| 37 | Team 38 | 9 | 230 | 282 | 82 | 63 | 137 | 285 | 430 | 0 | 0 | 1962 | 5 | New | 29.08 | 48.58 | 22.34 | 0.00 | 0.00 |
| 38 | Team 39 | 4 | 190 | 160 | 52 | 45 | 63 | 199 | 241 | 0 | 0 | 1994 | 10 | Newest | 32.50 | 39.38 | 28.12 | 0.00 | 0.00 |
| 39 | Team 40 | 5 | 188 | 186 | 50 | 46 | 90 | 202 | 296 | 0 | 0 | 1978 | 8 | New | 26.88 | 48.39 | 24.73 | 0.00 | 0.00 |
| 40 | Team 41 | 6 | 168 | 204 | 59 | 50 | 95 | 216 | 310 | 0 | 0 | 1971 | 12 | New | 28.92 | 46.57 | 24.51 | 0.00 | 0.00 |
| 41 | Team 42 | 6 | 150 | 180 | 53 | 44 | 83 | 165 | 221 | 0 | 0 | 1963 | 7 | New | 29.44 | 46.11 | 24.44 | 0.00 | 0.00 |
| 42 | Team 43 | 4 | 148 | 152 | 37 | 37 | 78 | 155 | 253 | 0 | 0 | 1999 | 17 | Newest | 24.34 | 51.32 | 24.34 | 0.00 | 0.00 |
| 43 | Team 44 | 3 | 132 | 114 | 35 | 27 | 52 | 139 | 167 | 0 | 0 | 2014 | 10 | Newest | 30.70 | 45.61 | 23.68 | 0.00 | 0.00 |
| 44 | Team 45 | 7 | 107 | 130 | 43 | 21 | 66 | 227 | 308 | 0 | 0 | 1929 | 3 | Oldest | 33.08 | 50.77 | 16.15 | 0.00 | 0.00 |
| 45 | Team 46 | 3 | 96 | 114 | 26 | 44 | 44 | 101 | 139 | 0 | 0 | 1990 | 9 | Newest | 22.81 | 38.60 | 38.60 | 0.00 | 0.00 |
| 46 | Team 47 | 4 | 91 | 116 | 34 | 16 | 66 | 181 | 295 | 0 | 0 | 1947 | 7 | Old | 29.31 | 56.90 | 13.79 | 0.00 | 0.00 |
| 47 | Team 48 | 2 | 83 | 80 | 20 | 23 | 37 | 62 | 117 | 0 | 0 | 1996 | 17 | Newest | 25.00 | 46.25 | 28.75 | 0.00 | 0.00 |
| 48 | Team 49 | 2 | 81 | 80 | 19 | 24 | 37 | 70 | 115 | 0 | 0 | 1995 | 19 | Newest | 23.75 | 46.25 | 30.00 | 0.00 | 0.00 |
| 49 | Team 50 | 4 | 76 | 108 | 30 | 16 | 62 | 145 | 252 | 0 | 0 | 1945 | 10 | Old | 27.78 | 57.41 | 14.81 | 0.00 | 0.00 |
| 50 | Team 51 | 3 | 71 | 90 | 29 | 13 | 48 | 121 | 183 | 0 | 0 | 1953 | 14 | New | 32.22 | 53.33 | 14.44 | 0.00 | 0.00 |
| 51 | Team 52 | 4 | 56 | 72 | 21 | 14 | 37 | 153 | 184 | 0 | 0 | 1929 | 6 | Oldest | 29.17 | 51.39 | 19.44 | 0.00 | 0.00 |
| 52 | Team 53 | 2 | 52 | 68 | 17 | 18 | 33 | 71 | 116 | 0 | 0 | 1979 | 10 | Newest | 25.00 | 48.53 | 26.47 | 0.00 | 0.00 |
| 53 | Team 54 | 3 | 42 | 54 | 18 | 6 | 30 | 97 | 131 | 0 | 0 | 1929 | 8 | Oldest | 33.33 | 55.56 | 11.11 | 0.00 | 0.00 |
| 54 | Team 55 | 2 | 40 | 68 | 13 | 14 | 41 | 70 | 182 | 0 | 0 | 1950 | 16 | Old | 19.12 | 60.29 | 20.59 | 0.00 | 0.00 |
| 55 | Team 56 | 1 | 35 | 38 | 8 | 11 | 19 | 36 | 55 | 0 | 0 | 2016 | 17 | Newest | 21.05 | 50.00 | 28.95 | 0.00 | 0.00 |
| 56 | Team 57 | 1 | 34 | 38 | 8 | 10 | 20 | 38 | 66 | 0 | 0 | 2009 | 20 | Newest | 21.05 | 52.63 | 26.32 | 0.00 | 0.00 |
| 57 | Team 58 | 1 | 22 | 30 | 7 | 8 | 15 | 37 | 57 | 0 | 0 | 1956 | 16 | New | 23.33 | 50.00 | 26.67 | 0.00 | 0.00 |
| 58 | Team 59 | 1 | 19 | 30 | 7 | 5 | 18 | 51 | 85 | 0 | 0 | 1951 | 16 | Old | 23.33 | 60.00 | 16.67 | 0.00 | 0.00 |
| 59 | Team 60 | 1 | 14 | 30 | 5 | 4 | 21 | 34 | 65 | 0 | 0 | 1955 | 15 | New | 16.67 | 70.00 | 13.33 | 0.00 | 0.00 |
basketball_data_sorted_by_won_games=basketball_data.sort_values('Win', ascending=False)
plt.figure(figsize=(35,15))
sns.scatterplot(data=basketball_data_sorted_by_won_games, x='Team', y='Win', hue='TeamLaunchCategorisation')
plt.xticks(rotation=45, weight='bold', fontsize=12)
plt.xlabel('Team', weight='bold', fontsize=16)
plt.ylabel('WonGames %', weight='bold', fontsize=16)
for i in range(0, len(basketball_data_sorted_by_won_games.TeamLaunch)):
plt.annotate(basketball_data_sorted_by_won_games.TeamLaunch[i], (basketball_data_sorted_by_won_games.Team[i], basketball_data_sorted_by_won_games.Win[i]))
plt.title('Team vs WonGames % with TeamLaunch and TeamLaunchCategorisation as bifurcation', weight='bold', fontsize=30)
plt.legend(prop={'size': 20})
plt.show()
Describing above graphical representation:
basketball_data_sorted_by_lost_games=basketball_data.sort_values('Lost', ascending=False)
plt.figure(figsize=(35,15))
sns.scatterplot(data=basketball_data_sorted_by_lost_games, x='Team', y='Lost', hue='TeamLaunchCategorisation')
plt.xticks(rotation=45, weight='bold', fontsize=12)
plt.xlabel('Team', weight='bold', fontsize=16)
plt.ylabel('LostGames %', weight='bold', fontsize=16)
for i in range(0, len(basketball_data_sorted_by_lost_games.TeamLaunch)):
plt.annotate(basketball_data_sorted_by_lost_games.TeamLaunch[i], (basketball_data_sorted_by_lost_games.Team[i], basketball_data_sorted_by_lost_games.Lost[i]))
plt.title('Team vs LostGames % with TeamLaunch and TeamLaunchCategorisation as bifurcation', weight='bold', fontsize=30)
plt.legend(prop={'size': 20})
plt.show()
Describing above graphical representation:
basketball_data_sorted_by_drawn_games=basketball_data.sort_values('Draw', ascending=False)
plt.figure(figsize=(35,15))
sns.scatterplot(data=basketball_data_sorted_by_drawn_games, x='Team', y='Draw', hue='TeamLaunchCategorisation')
plt.xticks(rotation=45, weight='bold', fontsize=12)
plt.xlabel('Team', weight='bold', fontsize=16)
plt.ylabel('DrawnGames %', weight='bold', fontsize=16)
for i in range(0, len(basketball_data_sorted_by_drawn_games.TeamLaunch)):
plt.annotate(basketball_data_sorted_by_drawn_games.TeamLaunch[i], (basketball_data_sorted_by_drawn_games.Team[i], basketball_data_sorted_by_drawn_games.Draw[i]))
plt.title('Team vs DrawnGames % with TeamLaunch and TeamLaunchCategorisation as bifurcation', weight='bold', fontsize=30)
plt.legend(prop={'size': 20})
plt.show()
Describing above graphical representation:
# Graphical Representation for Team Category and Games Played
basketball_data_team_categorisation=basketball_data.sort_values('PlayedGames').groupby('TeamLaunchCategorisation')['PlayedGames'].sum().reset_index()
plt.figure(figsize=(15,5))
sns.barplot(x ='TeamLaunchCategorisation',y='PlayedGames',data = basketball_data_team_categorisation)
plt.xlabel('TeamLaunchCategorisation', weight='bold', fontsize=16)
plt.ylabel('PlayedGames', weight='bold', fontsize=16)
plt.title("PlayedGames across Team Category",size=15)
plt.show()
Describing above graphical representation:
basketball_data
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | TeamLaunchCategorisation | Win | Lost | Draw | Champion% | RunnerUp% | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 | Oldest | 59.63 | 20.38 | 19.99 | 38.37 | 38.37 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 | Oldest | 57.24 | 22.01 | 20.75 | 29.07 | 29.07 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 | Oldest | 47.48 | 29.65 | 22.88 | 12.50 | 12.50 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931 | 1 | Oldest | 44.56 | 32.32 | 23.12 | 7.32 | 7.32 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 | Oldest | 43.77 | 33.31 | 22.92 | 9.30 | 9.30 |
| 5 | Team 6 | 73 | 2819 | 2408 | 990 | 531 | 887 | 3680 | 3373 | 1 | 4 | 1934 | 1 | Oldest | 41.11 | 36.84 | 22.05 | 1.37 | 1.37 |
| 6 | Team 7 | 82 | 2792 | 2626 | 948 | 608 | 1070 | 3609 | 3889 | 0 | 0 | 1929 | 3 | Oldest | 36.10 | 40.75 | 23.15 | 0.00 | 0.00 |
| 7 | Team 8 | 70 | 2573 | 2302 | 864 | 577 | 861 | 3228 | 3230 | 2 | 3 | 1929 | 1 | Oldest | 37.53 | 37.40 | 25.07 | 2.86 | 2.86 |
| 8 | Team 9 | 58 | 2109 | 1986 | 698 | 522 | 766 | 2683 | 2847 | 0 | 1 | 1939 | 2 | Old | 35.15 | 38.57 | 26.28 | 0.00 | 0.00 |
| 9 | Team 10 | 51 | 1884 | 1728 | 606 | 440 | 682 | 2159 | 2492 | 1 | 0 | 1932 | 1 | Oldest | 35.07 | 39.47 | 25.46 | 1.96 | 1.96 |
| 10 | Team 11 | 45 | 1814 | 1530 | 563 | 392 | 575 | 2052 | 2188 | 1 | 5 | 1941 | 1 | Old | 36.80 | 37.58 | 25.62 | 2.22 | 2.22 |
| 11 | Team 12 | 51 | 1789 | 1698 | 586 | 389 | 723 | 2278 | 2624 | 0 | 0 | 1939 | 4 | Old | 34.51 | 42.58 | 22.91 | 0.00 | 0.00 |
| 12 | Team 13 | 42 | 1471 | 1466 | 463 | 384 | 619 | 1767 | 2180 | 0 | 0 | 1948 | 4 | Old | 31.58 | 42.22 | 26.19 | 0.00 | 0.00 |
| 13 | Team 14 | 44 | 1416 | 1428 | 453 | 336 | 639 | 1843 | 2368 | 0 | 1 | 1929 | 2 | Oldest | 31.72 | 44.75 | 23.53 | 0.00 | 0.00 |
| 14 | Team 15 | 43 | 1389 | 1458 | 471 | 358 | 629 | 1753 | 2152 | 0 | 1 | 1944 | 2 | Old | 32.30 | 43.14 | 24.55 | 0.00 | 0.00 |
| 15 | Team 16 | 37 | 1351 | 1318 | 426 | 327 | 565 | 1500 | 1834 | 0 | 0 | 1935 | 4 | Oldest | 32.32 | 42.87 | 24.81 | 0.00 | 0.00 |
| 16 | Team 17 | 36 | 1314 | 1255 | 390 | 330 | 535 | 1421 | 1763 | 0 | 0 | 1949 | 4 | Old | 31.08 | 42.63 | 26.29 | 0.00 | 0.00 |
| 17 | Team 18 | 38 | 1174 | 1192 | 408 | 292 | 492 | 1642 | 1951 | 0 | 0 | 1933 | 3 | Oldest | 34.23 | 41.28 | 24.50 | 0.00 | 0.00 |
| 18 | Team 19 | 27 | 1148 | 988 | 333 | 256 | 399 | 1182 | 1371 | 0 | 0 | 1960 | 3 | New | 33.70 | 40.38 | 25.91 | 0.00 | 0.00 |
| 19 | Team 20 | 33 | 1020 | 1096 | 367 | 242 | 487 | 1347 | 1746 | 0 | 1 | 1951 | 2 | Old | 33.49 | 44.43 | 22.08 | 0.00 | 0.00 |
| 20 | Team 21 | 17 | 970 | 646 | 266 | 172 | 208 | 892 | 789 | 0 | 1 | 1998 | 2 | Newest | 41.18 | 32.20 | 26.63 | 0.00 | 0.00 |
| 21 | Team 22 | 23 | 667 | 742 | 218 | 175 | 349 | 819 | 1157 | 0 | 0 | 1941 | 6 | Old | 29.38 | 47.04 | 23.58 | 0.00 | 0.00 |
| 22 | Team 23 | 17 | 662 | 652 | 189 | 148 | 305 | 760 | 1088 | 0 | 0 | 1977 | 8 | New | 28.99 | 46.78 | 22.70 | 0.00 | 0.00 |
| 23 | Team 24 | 21 | 606 | 678 | 203 | 180 | 295 | 750 | 1022 | 0 | 0 | 1959 | 5 | New | 29.94 | 43.51 | 26.55 | 0.00 | 0.00 |
| 24 | Team 25 | 12 | 553 | 456 | 147 | 112 | 197 | 520 | 633 | 0 | 0 | 2004 | 6 | Newest | 32.24 | 43.20 | 24.56 | 0.00 | 0.00 |
| 25 | Team 26 | 20 | 538 | 628 | 184 | 149 | 295 | 716 | 1050 | 0 | 0 | 1935 | 5 | Oldest | 29.30 | 46.97 | 23.73 | 0.00 | 0.00 |
| 26 | Team 27 | 13 | 510 | 494 | 155 | 128 | 211 | 619 | 744 | 0 | 0 | 1961 | 5 | New | 31.38 | 42.71 | 25.91 | 0.00 | 0.00 |
| 27 | Team 28 | 18 | 445 | 586 | 145 | 143 | 298 | 607 | 992 | 0 | 0 | 1940 | 11 | Old | 24.74 | 50.85 | 24.40 | 0.00 | 0.00 |
| 28 | Team 29 | 12 | 421 | 380 | 125 | 81 | 174 | 458 | 623 | 0 | 0 | 1930 | 6 | Oldest | 32.89 | 45.79 | 21.32 | 0.00 | 0.00 |
| 29 | Team 30 | 11 | 416 | 402 | 113 | 95 | 194 | 430 | 632 | 0 | 0 | 1963 | 6 | New | 28.11 | 48.26 | 23.63 | 0.00 | 0.00 |
| 30 | Team 31 | 12 | 375 | 423 | 123 | 102 | 198 | 422 | 581 | 0 | 0 | 1974 | 7 | New | 29.08 | 46.81 | 24.11 | 0.00 | 0.00 |
| 31 | Team 32 | 14 | 353 | 426 | 129 | 95 | 202 | 492 | 720 | 0 | 0 | 1943 | 4 | Old | 30.28 | 47.42 | 22.30 | 0.00 | 0.00 |
| 32 | Team 33 | 12 | 343 | 448 | 104 | 127 | 217 | 393 | 662 | 0 | 0 | 1977 | 12 | New | 23.21 | 48.44 | 28.35 | 0.00 | 0.00 |
| 33 | Team 34 | 9 | 293 | 346 | 96 | 92 | 158 | 291 | 489 | 0 | 0 | 1987 | 7 | Newest | 27.75 | 45.66 | 26.59 | 0.00 | 0.00 |
| 34 | Team 35 | 11 | 285 | 334 | 103 | 79 | 152 | 419 | 588 | 0 | 0 | 1941 | 4 | Old | 30.84 | 45.51 | 23.65 | 0.00 | 0.00 |
| 35 | Team 36 | 7 | 277 | 270 | 76 | 76 | 118 | 320 | 410 | 0 | 0 | 1991 | 7 | Newest | 28.15 | 43.70 | 28.15 | 0.00 | 0.00 |
| 36 | Team 37 | 6 | 242 | 228 | 62 | 56 | 110 | 244 | 366 | 0 | 0 | 2007 | 8 | Newest | 27.19 | 48.25 | 24.56 | 0.00 | 0.00 |
| 37 | Team 38 | 9 | 230 | 282 | 82 | 63 | 137 | 285 | 430 | 0 | 0 | 1962 | 5 | New | 29.08 | 48.58 | 22.34 | 0.00 | 0.00 |
| 38 | Team 39 | 4 | 190 | 160 | 52 | 45 | 63 | 199 | 241 | 0 | 0 | 1994 | 10 | Newest | 32.50 | 39.38 | 28.12 | 0.00 | 0.00 |
| 39 | Team 40 | 5 | 188 | 186 | 50 | 46 | 90 | 202 | 296 | 0 | 0 | 1978 | 8 | New | 26.88 | 48.39 | 24.73 | 0.00 | 0.00 |
| 40 | Team 41 | 6 | 168 | 204 | 59 | 50 | 95 | 216 | 310 | 0 | 0 | 1971 | 12 | New | 28.92 | 46.57 | 24.51 | 0.00 | 0.00 |
| 41 | Team 42 | 6 | 150 | 180 | 53 | 44 | 83 | 165 | 221 | 0 | 0 | 1963 | 7 | New | 29.44 | 46.11 | 24.44 | 0.00 | 0.00 |
| 42 | Team 43 | 4 | 148 | 152 | 37 | 37 | 78 | 155 | 253 | 0 | 0 | 1999 | 17 | Newest | 24.34 | 51.32 | 24.34 | 0.00 | 0.00 |
| 43 | Team 44 | 3 | 132 | 114 | 35 | 27 | 52 | 139 | 167 | 0 | 0 | 2014 | 10 | Newest | 30.70 | 45.61 | 23.68 | 0.00 | 0.00 |
| 44 | Team 45 | 7 | 107 | 130 | 43 | 21 | 66 | 227 | 308 | 0 | 0 | 1929 | 3 | Oldest | 33.08 | 50.77 | 16.15 | 0.00 | 0.00 |
| 45 | Team 46 | 3 | 96 | 114 | 26 | 44 | 44 | 101 | 139 | 0 | 0 | 1990 | 9 | Newest | 22.81 | 38.60 | 38.60 | 0.00 | 0.00 |
| 46 | Team 47 | 4 | 91 | 116 | 34 | 16 | 66 | 181 | 295 | 0 | 0 | 1947 | 7 | Old | 29.31 | 56.90 | 13.79 | 0.00 | 0.00 |
| 47 | Team 48 | 2 | 83 | 80 | 20 | 23 | 37 | 62 | 117 | 0 | 0 | 1996 | 17 | Newest | 25.00 | 46.25 | 28.75 | 0.00 | 0.00 |
| 48 | Team 49 | 2 | 81 | 80 | 19 | 24 | 37 | 70 | 115 | 0 | 0 | 1995 | 19 | Newest | 23.75 | 46.25 | 30.00 | 0.00 | 0.00 |
| 49 | Team 50 | 4 | 76 | 108 | 30 | 16 | 62 | 145 | 252 | 0 | 0 | 1945 | 10 | Old | 27.78 | 57.41 | 14.81 | 0.00 | 0.00 |
| 50 | Team 51 | 3 | 71 | 90 | 29 | 13 | 48 | 121 | 183 | 0 | 0 | 1953 | 14 | New | 32.22 | 53.33 | 14.44 | 0.00 | 0.00 |
| 51 | Team 52 | 4 | 56 | 72 | 21 | 14 | 37 | 153 | 184 | 0 | 0 | 1929 | 6 | Oldest | 29.17 | 51.39 | 19.44 | 0.00 | 0.00 |
| 52 | Team 53 | 2 | 52 | 68 | 17 | 18 | 33 | 71 | 116 | 0 | 0 | 1979 | 10 | Newest | 25.00 | 48.53 | 26.47 | 0.00 | 0.00 |
| 53 | Team 54 | 3 | 42 | 54 | 18 | 6 | 30 | 97 | 131 | 0 | 0 | 1929 | 8 | Oldest | 33.33 | 55.56 | 11.11 | 0.00 | 0.00 |
| 54 | Team 55 | 2 | 40 | 68 | 13 | 14 | 41 | 70 | 182 | 0 | 0 | 1950 | 16 | Old | 19.12 | 60.29 | 20.59 | 0.00 | 0.00 |
| 55 | Team 56 | 1 | 35 | 38 | 8 | 11 | 19 | 36 | 55 | 0 | 0 | 2016 | 17 | Newest | 21.05 | 50.00 | 28.95 | 0.00 | 0.00 |
| 56 | Team 57 | 1 | 34 | 38 | 8 | 10 | 20 | 38 | 66 | 0 | 0 | 2009 | 20 | Newest | 21.05 | 52.63 | 26.32 | 0.00 | 0.00 |
| 57 | Team 58 | 1 | 22 | 30 | 7 | 8 | 15 | 37 | 57 | 0 | 0 | 1956 | 16 | New | 23.33 | 50.00 | 26.67 | 0.00 | 0.00 |
| 58 | Team 59 | 1 | 19 | 30 | 7 | 5 | 18 | 51 | 85 | 0 | 0 | 1951 | 16 | Old | 23.33 | 60.00 | 16.67 | 0.00 | 0.00 |
| 59 | Team 60 | 1 | 14 | 30 | 5 | 4 | 21 | 34 | 65 | 0 | 0 | 1955 | 15 | New | 16.67 | 70.00 | 13.33 | 0.00 | 0.00 |
columns=['Win','Draw','Lost','Champion%','RunnerUp%','HighestPositionHeld']
basketball_data_sorted_by_team_launch=basketball_data.sort_values('TeamLaunch')
fig=plt.figure(figsize=(30,20))
for i in range(0,len(columns)):
ax=fig.add_subplot(6,1,i+1)
sns.lineplot(data=basketball_data_sorted_by_team_launch, x='Team', y=columns[i], hue='TeamLaunchCategorisation', marker='D')
ax.set_title('Performance Report of team for '+columns[i] ,color='orange',weight='bold',fontsize=16)
plt.xlabel('Team', weight='bold', fontsize=12)
plt.ylabel(columns[i], weight='bold', fontsize=12)
plt.xticks(rotation=45)
plt.legend(loc=0)
plt.tight_layout()
plt.show()
Team performance has been defined in above graphical representation where we can see team specific win, loose, draw and more datas
basketball_data_group_lost_mean=basketball_data.groupby(['TeamLaunchCategorisation'])['Lost'].mean().reset_index()
basketball_data_group_draw_mean=basketball_data.groupby(['TeamLaunchCategorisation'])['Draw'].mean().reset_index()
basketball_data_group_win_mean=basketball_data.groupby(['TeamLaunchCategorisation'])['Win'].mean().reset_index()
basketball_data_group_champ_mean=basketball_data.groupby(['TeamLaunchCategorisation'])['Champion%'].mean().reset_index()
basketball_data_group_runner_mean=basketball_data.groupby(['TeamLaunchCategorisation'])['RunnerUp%'].mean().reset_index()
basketball_data_group_pos_mean=basketball_data.groupby(['TeamLaunchCategorisation'])['HighestPositionHeld'].mean().reset_index()
data_list=[basketball_data_group_win_mean,basketball_data_group_lost_mean,basketball_data_group_draw_mean, basketball_data_group_champ_mean, basketball_data_group_runner_mean, basketball_data_group_pos_mean]
columns=['Win','Lost','Draw','Champion%','RunnerUp%','HighestPositionHeld']
fig=plt.figure(figsize=(15,15))
for i in range(0,len(data_list)):
ax=fig.add_subplot(6,1,i+1)
sns.lineplot(data=data_list[i], x='TeamLaunchCategorisation', y=columns[i], marker='H')
ax.set_title('TeamLaunchCategorisation Avg '+columns[i] ,color='orange',weight='bold',fontsize=16)
plt.xlabel('Team', weight='bold', fontsize=12)
plt.ylabel(columns[i], weight='bold', fontsize=12)
plt.tight_layout()
plt.show()
Describing above graphical representation
plt.figure(figsize=(15,7))
sns.pairplot(basketball_data[['Team', 'Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames', 'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion', 'RunnerUp', 'TeamLaunch', 'HighestPositionHeld']], corner=True)
plt.show();
<Figure size 1080x504 with 0 Axes>
Describing above graphical representation
plt.figure(figsize=(15,7))
sns.heatmap(basketball_data[['Team', 'Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames', 'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion', 'RunnerUp', 'TeamLaunch', 'HighestPositionHeld']].corr(), fmt='.2f',annot=True)
plt.show();
Describing above graphical representation
company=pd.read_csv('CompanyX_EU.csv')
company.head()
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
company.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 662 entries, 0 to 661 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Startup 662 non-null object 1 Product 656 non-null object 2 Funding 448 non-null object 3 Event 662 non-null object 4 Result 662 non-null object 5 OperatingState 662 non-null object dtypes: object(6) memory usage: 31.2+ KB
6 columns are there with Dtype as object 662 entires are present
company.isnull().sum()
Startup 0 Product 6 Funding 214 Event 0 Result 0 OperatingState 0 dtype: int64
Total 220 null values are present, for which Funding has 214 as null and Product has 6 as null.
company_wo_null=company.dropna()
company_wo_null.reset_index(drop=True, inplace=True)
company_wo_null.isnull().sum()
Startup 0 Product 0 Funding 0 Event 0 Result 0 OperatingState 0 dtype: int64
All null values are dropped.
company_wo_null.loc[:,'FundsInMillion']=company_wo_null['Funding'].apply(lambda x: float(x[1:-1])/1000 if x[-1] == 'K' else (float(x[1:-1])*1000 if x[-1] == 'B' else float(x[1:-1])))
company_wo_null
| Startup | Product | Funding | Event | Result | OperatingState | FundsInMillion | |
|---|---|---|---|---|---|---|---|
| 0 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | 0.63 |
| 1 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | 1.00 |
| 2 | 5to1 | 5to1.com | $19.3M | TC50 2009 | Contestant | Acquired | 19.30 |
| 3 | 8 Securities | 8securities.com | $29M | Disrupt Beijing 2011 | Finalist | Operating | 29.00 |
| 4 | AdhereTech | adheretech.com | $1.8M | Hardware Battlefield 2014 | Contestant | Operating | 1.80 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 441 | Zivity | zivity.com | $8M | TC40 2007 | Contestant | Operating | 8.00 |
| 442 | Zmorph | zmorph3d.com | $1M | - | Audience choice | Operating | 1.00 |
| 443 | Zocdoc | zocdoc.com | $223M | TC40 2007 | Contestant | Operating | 223.00 |
| 444 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating | 3.40 |
| 445 | Zumper | zumper.com | $31.5M | Disrupt SF 2012 | Finalist | Operating | 31.50 |
446 rows × 7 columns
company_wo_null.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 446 entries, 0 to 445 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Startup 446 non-null object 1 Product 446 non-null object 2 Funding 446 non-null object 3 Event 446 non-null object 4 Result 446 non-null object 5 OperatingState 446 non-null object 6 FundsInMillion 446 non-null float64 dtypes: float64(1), object(6) memory usage: 24.5+ KB
FundsInMillion column has been added with cleansed values and data type as numeric.
box_graph=px.box(company_wo_null, x='FundsInMillion')
box_graph.show()
There are ample number of outliers present in dataset for funds
q1=np.quantile(company_wo_null.FundsInMillion, 0.25)
q3=np.quantile(company_wo_null.FundsInMillion, 0.75)
cut_off=1.5*(q3-q1)
right_whisker= q3 + 1.5*(q3-q1)
company_wo_null[company_wo_null.FundsInMillion>right_whisker].FundsInMillion.count()
60
Total 60 values are present in FundsInMillion which are outliers.
company_wo_null.FundsInMillion.describe()
count 446.000000 mean 17.241489 std 90.483710 min 0.005000 25% 0.745250 50% 2.200000 75% 9.475000 max 1700.000000 Name: FundsInMillion, dtype: float64
To remove outliers, will be using IQR formula, as there are extreme outliers present. Q3 is 9.475 and max is 1700, which is very huge difference, so won't be using z-score.
#Replace every outlier on the upper side by the upper whisker
company_iqr=company_wo_null.copy(deep=True)
for i in np.where(company_iqr.FundsInMillion > q3 + cut_off)[0]:
whisker = q3 + cut_off
company_iqr.loc[i,'FundsInMillion'] = whisker
company_iqr
| Startup | Product | Funding | Event | Result | OperatingState | FundsInMillion | |
|---|---|---|---|---|---|---|---|
| 0 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | 0.630000 |
| 1 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | 1.000000 |
| 2 | 5to1 | 5to1.com | $19.3M | TC50 2009 | Contestant | Acquired | 19.300000 |
| 3 | 8 Securities | 8securities.com | $29M | Disrupt Beijing 2011 | Finalist | Operating | 22.569625 |
| 4 | AdhereTech | adheretech.com | $1.8M | Hardware Battlefield 2014 | Contestant | Operating | 1.800000 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 441 | Zivity | zivity.com | $8M | TC40 2007 | Contestant | Operating | 8.000000 |
| 442 | Zmorph | zmorph3d.com | $1M | - | Audience choice | Operating | 1.000000 |
| 443 | Zocdoc | zocdoc.com | $223M | TC40 2007 | Contestant | Operating | 22.569625 |
| 444 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating | 3.400000 |
| 445 | Zumper | zumper.com | $31.5M | Disrupt SF 2012 | Finalist | Operating | 22.569625 |
446 rows × 7 columns
box_graph=px.box(company_iqr, x='FundsInMillion')
box_graph.show()
All outliers have been removed from FundsInMillion.
plt.figure(figsize=(15,7))
sns.countplot(data=company_iqr, x='OperatingState')
plt.xlabel('OperatingState', weight='bold', fontsize=12)
plt.show()
Describing above graphical representation
company_iqr_group_operating_state=company_iqr.groupby(['OperatingState'])['FundsInMillion'].sum().reset_index()
hist_graph=px.histogram(company_iqr_group_operating_state, x='OperatingState', y='FundsInMillion',title='OperatingState vs FundsRaised')
hist_graph.update_layout(xaxis_title="OperatingState", yaxis_title="FundsInMillion", title_x=0.5)
hist_graph.show()
Company whose status is operating has highest fundsRaised.
Step-1: Null Hypothesis: Ho: There is no significant difference between Funds raised by companies that are still operating vs companies that closed down. Alternative Hypothesis: Ha: There is significant difference between Funds raised by companies that are still operating vs companies that closed down.
Step-2: Level of Significance, alpha =0.05(default value)
Step-3: As 2 independent samples are present for FundsRaised for operating and closed, so we can use 2 sample T test.
# Step-4: Calculate p_val
t_stat, p_val=stats.ttest_ind(company_iqr[company_iqr.OperatingState=='Operating']['FundsInMillion'], company_iqr[company_iqr.OperatingState=='Closed']['FundsInMillion'], equal_var=False)
t_stat, p_val
(4.039228118032694, 9.67724782248064e-05)
Step-5: As p_val(9.677e-05) is smaller than level of significant(0.05), we can reject the null hypothesis. Yes, there is significant difference between Funds raised by companies that are still operating vs companies that closed down.
company_copy=company.copy(deep=True)
company_copy
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
| ... | ... | ... | ... | ... | ... | ... |
| 657 | Zivity | zivity.com | $8M | TC40 2007 | Contestant | Operating |
| 658 | Zmorph | zmorph3d.com | $1M | - | Audience choice | Operating |
| 659 | Zocdoc | zocdoc.com | $223M | TC40 2007 | Contestant | Operating |
| 660 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating |
| 661 | Zumper | zumper.com | $31.5M | Disrupt SF 2012 | Finalist | Operating |
662 rows × 6 columns
plt.figure(figsize=(15,7))
sns.countplot(data=company_copy, x='Result')
plt.xlabel('Result Variables', weight='bold', fontsize=14)
plt.show()
Describing above graphical representation
total_operating_count=company_copy.OperatingState.value_counts().get('Operating')
company_copy[(company_copy.Result=='Winner') & (company_copy.OperatingState=='Operating')]['Result'].count()/total_operating_count, company_copy[(company_copy.Result=='Contestant') & (company_copy.OperatingState=='Operating')]['Result'].count()/total_operating_count
(0.04086021505376344, 0.7139784946236559)
4.08% are winners that are still operating 71.3% are contestants that are still operating
Step-1 Null Hypothesis, Ho: No difference in proportion of companies that are operating between winners and contestants Alternative Hypothesis, Ha: There is difference in proportion of companies that are operating between winners and contestants
Step-2 Level of significance, alpha=0.05(default)
Step-3 We can use Chi-Square test of independence as these are categorical data. Let's build contigency table
# Creation of contigency table for further checks
contigency_table=pd.crosstab(company_copy.OperatingState, company_copy.Result, margins=True)
contigency_table
| Result | Audience choice | Contestant | Finalist | Runner up | Winner | All |
|---|---|---|---|---|---|---|
| OperatingState | ||||||
| Acquired | 0 | 62 | 15 | 2 | 7 | 86 |
| Closed | 9 | 90 | 7 | 0 | 0 | 106 |
| Ipo | 0 | 4 | 1 | 0 | 0 | 5 |
| Operating | 32 | 332 | 61 | 21 | 19 | 465 |
| All | 41 | 488 | 84 | 23 | 26 | 662 |
# Step-4: Calculate p_val
chi2, p_val, dof, exp_freq=stats.chi2_contingency(contigency_table, correction=False)
chi2, p_val, dof, exp_freq
(27.904309161964264,
0.1116839954038626,
20,
array([[5.32628399e+00, 6.33957704e+01, 1.09123867e+01, 2.98791541e+00,
3.37764350e+00, 8.60000000e+01],
[6.56495468e+00, 7.81389728e+01, 1.34501511e+01, 3.68277946e+00,
4.16314199e+00, 1.06000000e+02],
[3.09667674e-01, 3.68580060e+00, 6.34441088e-01, 1.73716012e-01,
1.96374622e-01, 5.00000000e+00],
[2.87990937e+01, 3.42779456e+02, 5.90030211e+01, 1.61555891e+01,
1.82628399e+01, 4.65000000e+02],
[4.10000000e+01, 4.88000000e+02, 8.40000000e+01, 2.30000000e+01,
2.60000000e+01, 6.62000000e+02]]))
Step-5: As p_val(0.1116) is greater than level of significant(0.05), we can't reject the null hypothesis. Hence, no difference in proportion of companies that are operating between winners and contestants
# Getting all year from events for further checks
company_copy_event=company_copy.copy(deep=True)
company_copy_event.loc[:,'EventYear']=company_copy_event.Event.apply(lambda x: x[-5:] if x != '-' else 0).astype(int)
company_copy_event
| Startup | Product | Funding | Event | Result | OperatingState | EventYear | |
|---|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating | 2013 |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | 2013 |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating | 2016 |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | 2016 |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating | 2013 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 657 | Zivity | zivity.com | $8M | TC40 2007 | Contestant | Operating | 2007 |
| 658 | Zmorph | zmorph3d.com | $1M | - | Audience choice | Operating | 0 |
| 659 | Zocdoc | zocdoc.com | $223M | TC40 2007 | Contestant | Operating | 2007 |
| 660 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating | 2013 |
| 661 | Zumper | zumper.com | $31.5M | Disrupt SF 2012 | Finalist | Operating | 2012 |
662 rows × 7 columns
company_copy_event[(company_copy_event.EventYear>=2013) & (company_copy_event.Event.str.contains('Disrupt'))].reset_index()
| index | Startup | Product | Funding | Event | Result | OperatingState | EventYear | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating | 2013 |
| 1 | 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | 2013 |
| 2 | 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating | 2016 |
| 3 | 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | 2016 |
| 4 | 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating | 2013 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 271 | 646 | YayPay Inc | yaypay.com | $900K | Disrupt London 2015 | Contestant | Operating | 2015 |
| 272 | 648 | YOOBIC | yoobic.com | NaN | Disrupt London 2015 | Finalist | Operating | 2015 |
| 273 | 653 | ZAP! | zapreklam.com/ | NaN | Disrupt EU 2014 | Audience choice | Operating | 2014 |
| 274 | 656 | Zenefits | zenefits.com | $583.6M | Disrupt NYC 2013 | Finalist | Operating | 2013 |
| 275 | 660 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating | 2013 |
276 rows × 8 columns
276 companies are present having ‘disrupt’ keyword from 2013 onwards.